import requests
payload = {
'api_key': 'API_KEY',
'query': 'iphone 15 charger',
's': 'price-asc-rank'
}
response = requests.get('https://api.scraperapi.com/structured/amazon/search',
params=payload).json()Data Prep / EDA
Where the data source, processing, and visualization (EDA) is presented.
Data Collection
Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.
The jupyter notebook code for the web scraping can be found here.
Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).
The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.
| type | position | asin | name | image | has_prime | is_best_seller | is_amazon_choice | is_limited_deal | stars | total_reviews | url | availability_quantity | spec | price_string | price_symbol | price | original_price | section_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | search_product | 32 | B0CV5Z91QR | Better Home Products Megan Wooden 6 Drawer Dou... | https://m.media-amazon.com/images/I/81fL4qHZ4i... | False | False | False | False | NaN | NaN | https://www.amazon.com/Better-Home-Products-Wo... | NaN | {} | NaN | NaN | NaN | NaN | NaN |
| 1 | search_product | 40 | B0CTZQ73SH | FixtureDisplays® Bathtub and Shower Tension St... | https://m.media-amazon.com/images/I/51F0eGaPI+... | False | False | False | False | NaN | NaN | https://www.amazon.com/FixtureDisplays%C2%AE-B... | NaN | {} | $22.60 | $ | 22.60 | NaN | NaN |
| 2 | search_product | 15 | B0CR633W5W | 4-Port USB 3.0 Hub Ultra-Thin Expand PC Connec... | https://m.media-amazon.com/images/I/51sJxlimwP... | False | False | False | False | NaN | NaN | https://www.amazon.com/4-Port-Ultra-Thin-Expan... | NaN | {} | $3.99 | $ | 3.99 | NaN | NaN |
| 3 | search_product | 15 | B0CTQSPM8G | MK000960GWSSD 960GB SATA 6G MU SFF SC DS SSD-1... | https://m.media-amazon.com/images/I/11rdwQUxu0... | False | False | False | False | NaN | NaN | https://www.amazon.com/Genuine-Original-MK0009... | NaN | {} | $375.00 | $ | 375.00 | NaN | NaN |
| 4 | search_product | 44 | B0CJ9ZF7LQ | SOL DE JANEIROCheirosa '62 Hair & Body Fragran... | https://m.media-amazon.com/images/I/31mCVvwGqC... | False | False | False | False | 4.6 | 14.0 | https://www.amazon.com/SOL-JANEIRO-Cheirosa-Fr... | NaN | {} | $62.00 | $ | 62.00 | NaN | NaN |
| asin | title | imgUrl | productURL | stars | reviews | price | listPrice | category_id | isBestSeller | boughtInLastMonth | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B01CR9B6UY | DC Cargo E-Track Ratcheting Straps Cargo Tie-D... | https://m.media-amazon.com/images/I/81KFTHU7xx... | https://www.amazon.com/dp/B01CR9B6UY | 4.7 | 593 | 36.99 | 0.00 | 24 | False | 100 |
| 1 | B0BWFRDBBW | Little Girls Glitter Tulle Dress, Sparkle Polk... | https://m.media-amazon.com/images/I/711QwZxcW8... | https://www.amazon.com/dp/B0BWFRDBBW | 3.9 | 7 | 46.99 | 0.00 | 91 | False | 0 |
| 2 | B00XV8CQGO | Club Little Girl Summer Animal Collection Stic... | https://m.media-amazon.com/images/I/41wQCnt7+E... | https://www.amazon.com/dp/B00XV8CQGO | 0.0 | 0 | 9.99 | 0.00 | 95 | False | 50 |
| 3 | B08MBJX3K1 | Toddler Boys Girls Sneakers Size 5-12 Lightwei... | https://m.media-amazon.com/images/I/81IscOXT6S... | https://www.amazon.com/dp/B08MBJX3K1 | 4.5 | 0 | 31.99 | 35.99 | 97 | False | 0 |
| 4 | B08YKJDWLF | CFS Cut to Fit Carbon Pad Pre Filter Roll for ... | https://m.media-amazon.com/images/I/61j4Gpu4jH... | https://www.amazon.com/dp/B08YKJDWLF | 4.6 | 0 | 12.59 | 0.00 | 171 | False | 0 |
Data Cleaning
The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.
The steps to clean the web-scaped data were:
- Add
date_scrapedcolumn - Remove unecessary columns:
type,position,has_prime,is_amazon_choice,is_limited_deal,availability_quantity,spec,price_string,price_symbol,section_name - Expand and fix
original_price - Rename columns to match standard snake case for merging both datasets
- Drop rows with no asin or name or price
- Drop rows with price of 0.0, since that doesn’t make sense
- Fill NaN
reviewscolumn with 0
The steps to clean the Kaggle data were:
- Add
date_scrapedcolumn - Drop rows with any NaNs
- Fix
list_priceof $0 to be instead equal toprice - Change
category_idto actual category by usingcategorytable - Drop rows with price of $0, since that doesn’t make sense
- Rename columns to match standard snake case for merging both datasets
And then, after they were concatenated, the steps to clean were:
- Remove duplicates (by asin + date scraped)
- Rename columns
The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):
| Asin | Name | Image Url | Is Best Seller | Stars | Reviews | Url | Price | Date Scraped | List Price | Bought In Month | Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B077H6599Q | Generac 7103 Cold Kit for 9kW - 22kW Air Coole... | https://m.media-amazon.com/images/I/71+HddWhsK... | False | 4.5 | 0.0 | https://www.amazon.com/dp/B077H6599Q | 94.99 | 2023-11-01 | 105.99 | 0.0 | RV Parts & Accessories |
| 1 | B095HLD52Z | 3 Pack Apple MFi Certified iPhone Charger Cabl... | https://m.media-amazon.com/images/I/61rqFEt6ku... | False | 4.5 | 0.0 | https://www.amazon.com/dp/B095HLD52Z | 20.99 | 2023-11-01 | 20.99 | 0.0 | Accessories & Supplies |
| 2 | B08XK4CBP8 | 164pcs Blush White Balloons Garland Arch Kit P... | https://m.media-amazon.com/images/I/71oD9RPq84... | False | 4.5 | 0.0 | https://www.amazon.com/dp/B08XK4CBP8 | 14.99 | 2023-11-01 | 16.99 | 50.0 | Party Decorations |
| 3 | B0C1Z9G9J9 | Mudtun Bone Conduction Earbuds for Small Ear C... | https://m.media-amazon.com/images/I/51js5mISHA... | False | 3.5 | 74.0 | https://www.amazon.com/dp/B0C1Z9G9J9 | 19.99 | 2023-11-01 | 19.99 | 0.0 | Headphones & Earbuds |
| 4 | B0C5NB4SRN | V-SHURA Music Gifts Table Lamp, Guitar Table L... | https://m.media-amazon.com/images/I/710DKiJYTV... | False | 4.8 | 0.0 | https://www.amazon.com/dp/B0C5NB4SRN | 55.99 | 2023-11-01 | 55.99 | 0.0 | Home Lighting & Ceiling Fans |
The code for the data cleaning can be found here.
Data Preprocessing / Visualization
Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.
If the interactive figures don’t load (is blank screen space with caption showing), dont worry: just turn off all ad-blockers/privacy browsing, make sure that Chrome/Firefox is being used, and refresh the page (or close and re-open the website) until all figures load.
First, a histogram of all categories of all Amazon products is shown in Figure 1. Note scraped data did not have categories, but the Kaggle data did. It can be seen that from the figure, girl’s and boy’s clothing are the most populous categories, with toys & games the next most populous.